Efficient Deep Web Crawling Using Reinforcement Learning
نویسندگان
چکیده
Deep web refers to the hidden part of the Web that remains unavailable for standard Web crawlers. To obtain content of Deep Web is challenging and has been acknowledged as a significant gap in the coverage of search engines. To this end, the paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and selects an action (query) to submit to the environment according to Q-value. The framework not only enables crawlers to learn a promising crawling strategy from its own experience, but also allows for utilizing diverse features of query keywords. Experimental results show that the method outperforms the state of art methods in terms of crawling capability and breaks through the assumption of full-text search implied by existing methods.
منابع مشابه
Learning to Surface Deep Web Content
We propose a novel deep web crawling framework based on reinforcement learning. The crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and submits a selected action (query) to the environment according to Q-value. Based on the framework we develop an adaptive crawling method. Experimental results show that it outperforms the state of ...
متن کاملExpanding Reinforcement Learning Approaches for Efficient Crawling the Web
The amount of accessible information on World Wide Web is increasing rapidly, so that a general-purpose search engine cannot index everything on the Web. Focused crawlers have been proposed as a potential approach to overcome the coverage problem of search engines by limiting the domain of concentration of them. Focused crawling is a technique which is able to crawl particular topical portions ...
متن کاملFICA: A novel intelligent crawling algorithm based on reinforcement learning
The web is a huge and highly dynamic environment which is growing exponentially in content and developing fast in structure. No search engine can cover the whole web, thus it has to focus on the most valuable pages for crawling. So an efficient crawling algorithm for retrieving the most important pages remains a challenging issue. Several algorithms like PageRank and OPIC have been proposed. Un...
متن کاملA Framework for Deep Web Crawler Using Genetic Algorithm
The Web has become one of the largest and most readily accessible repositories of human knowledge. The traditional search engines index only surface Web whose pages are easily found. The focus has now been moved to invisible Web or hidden Web, which consists of a large warehouse of useful data such as images, sounds, presentations and many other types of media. To use such data, there is a need...
متن کاملTeaching Reinforcement Learning using a Physical Robot
This paper presents a little crawling robot as a didactic instrument for teaching reinforcement learning. The robot learns a forwardwalking policy from scratch in less than 20 seconds of reinforced sensorimotor interactions. The state space consists of two discretized dimensions, where the behavior is visualizable and comprehensible. In laboratory tutorials, students conduct experiments with a ...
متن کامل